Proteomics-driven noninvasive screening of circulating serum protein panels for the early diagnosis of hepatocellular carcinoma

Early diagnosis of hepatocellular carcinoma (HCC) lacks highly sensitive and specific protein biomarkers. Here, we describe a staged mass spectrometry (MS)-based discovery-verification-validation proteomics workflow to explore serum proteomic biomarkers for HCC early diagnosis in 1002 individuals. Machine learning model determined as P4 panel (HABP2, CD163, AFP and PIVKA-II) clearly distinguish HCC from liver cirrhosis (LC, AUC 0.979, sensitivity 0.925, specificity 0.915) and healthy individuals (HC, AUC 0.992, sensitivity 0.975, specificity 1.000) in an independent validation cohort, outperforming existing clinical prediction strategies. Furthermore, the P4 panel can accurately predict LC to HCC conversion (AUC 0.890, sensitivity 0.909, specificity 0.877) with predicting HCC at a median of 11.4 months prior to imaging in prospective external validation cohorts (No.: Keshen 2018_005_02 and NCT03588442). These results suggest that proteomics-driven serum biomarker discovery provides a valuable reference for the liquid biopsy, and has great potential to improve early diagnosis of HCC.


Introduction
Hepatocellular carcinoma (HCC) ranks the fourth in cancer mortality worldwide, and chronic cirrhosis pathology caused by hepatitis virus (mainly Hepatitis B and C Virus) and metabolic diseases (mainly alcoholic liver disease and diabetes) is the major risk factor for HCC [1,2] .Although surgery remains an effective therapy for HCC patients according to the HCC treatment guidelines, most patients are diagnosed at advanced clinical stage due to the lack of early symptoms and thus suffer from poor outcomes [3] .Thus, early screening and diagnosis of HCC still remains a clinical dilemma.
Current strategies for HCC diagnosis include imaging (CT/MRI), serum protein biomarkers (alpha-fetoprotein (AFP), protein induced by vitamin K absence or antagonist-II (PIVKA-II, namely Des-gamma-carboxy prothrombin (DCP)) and histopathology, which are di cult to accurately diagnose early stage HCC due to empirical limitations, restricted sensitivity or invasive detection modalities [4,5] .Serum and plasma are routinely collected in patients with liver symptoms and re ect changes in liver function, making them ideal for liquid biopsy with great safety, simplicity and suitable for large populations with long period follow-up [6,7] , and many circulating liquid biopsy tumor biomarkers such as ctDNA [8] , cfDNA [9,10] , metabolite [11] and proteins [12][13][14] are developed rapidly.Plasma or serum proteins, as the ultimate bearers and effectors of human biological activities, are the common study objects in biomarker development.The FDA has approved over 100 plasma or serum proteins and some serum protein biomarkers have been tested for long-term clinical applications [15,16] .Therefore, system-wide discovery of serum protein biomarkers for early diagnosis of HCC would be very attractive, and these global data could be used to build up machine learning-based classi cation models for the early diagnosis of HCC.
Mass spectrometry (MS)-based proteomics is in principle an ideal tool for biomarker discovery.However, proteomic analysis of serum or plasma has been challenging because of low protein concentrations and a wide dynamic range of protein abundances, resulting in low quanti cation precision, throughput, and limited proteome depth [16][17][18] .Recent advances in MS-based proteomics have greatly improved the depth and breadth of serum and plasma proteins, and extended its impact in biomedical and clinical studies [19] .Data independent acquisition based MS (DIA-MS) could effectively avoid the masking effect of high abundance proteins (HAP) on low abundance proteins (LAP), and improve detection e ciency and sample reproducibility; therefore it has been widely used in the development of tumor serum biomarkers [20] .Furthermore, circulating proteomic panels for diagnosis and risk strati cation of various tumors were developed using the targeted proteomic strategy, avoiding the restrictions of antibodies [21,22] .The speci city for the identi cation and quanti cation of hundreds and even thousands of proteins in serum or plasma samples makes it suitable in principle for the identi cation and validation of biomarkers.Many research groups have performed a series of biologically meaningful proteomic studies in clinical samples from various clinical cohorts using the DIA + PRM work ow [23][24][25][26] .For HCC biomarker discovery, the DIA + PRM strategy has only been applied to very small clinical cohorts, and there is a lack of screening and validation studies in large clinical cohorts [27] .
In this study, using serum as a liquid biopsy, we performed a staged MS-based discovery-veri cation-validation proteomics work ow in 662 individuals to screen HCC diagnosis biomarkers, from which a biomarker panel was developed by learning machine for diagnosis of HCC patients.Furthermore, the clinical signi cance of this panel for early diagnosis and even early predicting of HCC was further evaluated in a prospective cohort.The aim of this research was to reveal the change of serum proteins in HCC patients, discovering valuable serum protein biomarkers for early diagnosis of HCC, and providing valuable data resource for HCC study.

Study design and clinical characteristics of serum specimens
To systemically identify and validate potential noninvasive protein biomarkers for HCC diagnosis in serum, we performed a staged MS-based discovery-veri cation-validation proteomics work ow for this study (Fig. 1).For discovery cohort, 320 individuals including HCC (n = 163), liver cirrhosis (LC, n = 53), basic liver diseases (BLD, n = 64, including 16 chronic hepatitis B (CHB), 18 alcoholic liver disease (ALD) and 30 non-alcoholic fatty liver disease (NAFLD) samples) and chronic asymptomatic hepatitis B virus carrier (AsC, n = 40) patients were included for DIA-MS quantitative proteomic analysis.The detailed clinical information was shown in Table S1.There was no statistically signi cant differences in routine indicators such as gender among patients in different groups (Table S2).However, the indicators re ecting the severity of the liver function decompensation of patients differed signi cantly among 4 groups, which was consistent with the progression of the disease from benign to malignant.The validation cohort consisted by 210 HCC and 132 LC patients (including 17 LCs who developed HCC during follow-up) for targeted quanti cation of the candidate biomarkers by PRM-MS.And further machine learning models based on early diagnosis panel for HCC were developed and used for the prediction of HCC risk.

Proteomic characterization of serum samples
We performed proteome pro ling of serum samples using high-throughput DIA-MS based proteomics strategy (Fig. 2A).To maximize proteome depth and coverage, we generated a hybrid spectral library consisting of 128 fractions of pooled serum samples from DDA and 320 individual serum samples from DIA.The hybrid spectral library contained 875 proteins, of which 82 proteins were detected only in the DIA-MS data (Figure S1A).And the majority of the library proteins (85.94%; 752/875) were reported in the Plasma Proteome Database (PPD, http://plasmaproteomedatabase.org/)(Figure S1B).Using this spectral library, the DIA-MS analysis acquired 451 quanti able proteins, which occupied more than half of proteins in the library (51.5%; 451/875) (Fig. 2B, Table S3).On average, we quanti ed 300 (AsC), 278 (BLD), 289 (LC) and 304 (HCC) proteins in each group per serum sample in a single run (Fig. 2C).Our DIA work ow resulted in a comparable serum proteome coverage with previous studies that applied a similar single-run strategy without deleting serum HAPs [27] .The quanti cation of serum protein intensity spanned over 4 orders of magnitude, and the top 10 most abundant proteins account for about 80% of the serum proteome signal, illustrating the challenge of analysis (Fig. 2D).

Assessment of quanti cation precision and sample quality
To assess quantitative precision in our study, we investigated the variables of our work ow by repeatedly measuring a Hela cell protein digest standard sample throughout the process, including DDA-MS and DIA-MS.The quality-control analysis of different replications showed high technical reproducibility for DDA-MS and DIA-MS, with an average number of quanti ed proteins of 3591 and 4120 (Fig. 3A), coe cients of variations (CVs) of 0.31 and 0.18 (Fig. 3B), and correlation coe cients of 0.952 and 0.935 in DDA-MS and DIA-MS, respectively (Figure S1E-F).
In contrast to the standards, the CVs values of the 4 groups were signi cantly higher than that in the standard group (Fig. 3C), and the correlation of identi ed proteins among 4 groups revealed the high heterogeneity within the LC group and within the HCC group (Figure S1C).Consistent with clinical perceptions, AFP and PIVKA-II were higher in HCC than that in non-HCC patients (Figure S1D), and MS quantitative proteomic results of AFP showed a high correlation with clinical antibody-based assays (Fig. 3D-E).When we used Yoden index threshold to classify MS-based AFP quantitation into positive and negative, 82.2% of the patients were consistent with the results that de ned by clinical AFP or PIVKA-II antibody assay (Fig. 2F).These results strongly a rmed the high quality of our proteomic data.

Differentially abundant proteins and functional alterations related to HCC
To further screening of meaningful diagnostic biomarkers for HCC, 250 immunoglobulins were excluded from further analyses.A total of 17 up-regulated and 17 down-regulated proteins differed in HCC/ AsC, HCC/ BLD and HCC/ LC comparisons were used for further analysis (Figure S2A-B, Table S4).The expression pro les of these proteins clearly showed intergroup differences and trended with disease severity, with the most dramatic differences in the HCC group (Fig. 4A-B).
As expected, most proteins located in extracellular space, extracellular exosome, extracellular region, and blood microparticle, which was consistent with the characteristics of serum proteins.Theses dysregulated proteins mainly enriched in the biological process of immunity and in ammation, as well as in molecular functions associated with activation of multiple receptors and various enzymatic activities related to tumorigenesis and development.Moreover, the enriched pathways like complement and coagulation cascades, NOD-like receptor signaling pathway, NF-kappa B signaling pathway, Toll-like receptor signaling pathway, TNF signaling pathway and leukocyte transendothelial migration, indicating that HCC was likely to promote its own development by regulating a variety of receptors or pathways related to immune and in ammatory (Fig. 4C).
In addition, PPI analysis revealed 3 highly connected clusters involving cell proliferation and apoptosis (blue), cell adhesion and recognition (red), complement activation and innate immunity (green) (Fig. 4D).In the blue and green clusters, the abundance of up-regulated proteins changed more dramatically, suggesting that these proteins might played a dominant role in HCC proliferation, development and migration.In the green cluster, the abundance of most proteins decreased, suggesting that HCC might have some inhibitory regulation of the immune system.Candidate proteins for further validation were mainly selected from these three clusters.

Veri cation of serum candidate biomarkers using PRM-based targeted MS
Based on the HCC-related proteomic and functional alteration revealed in the discovery study, we then sought to develop protein biomarkers that re ect the HCC occurrence with high accuracy.Firstly, the LVQ model was used to evaluate the diagnostic performance of 34 HCC-related differentially abundant proteins, by comparing the accuracy of each protein in identifying HCC patients, 15 proteins with accuracy higher than 0.8 were selected (Fig. 5A).Secondly, the candidate proteins required unique peptides and a good peptide pro le matched in DIA-MS data.Finally, 11 candidate biomarkers with unique peptides and aberrant abundance in HCC were proposed for further targeted proteomics analysis (Fig. 5B, Table S5).In order to verify the authenticity of candidate biomarker, we further validated the abundance of matched peptides of them in validation cohort containing 210 HCC and 132 LC patients by PRM-MS.And 5 peptides could be quanti ed in more than 3 pairs of ions matched in the light and heavy labels, and their quanti cation were statistically signi cant in the HCC and LC groups (adjusted pvalue 0.05), which was consistent with the trend of DIA-MS results (Fig. 5C, Figure S3, Table S6).Thus, these 5 proteins were used in different combinations to construct HCC diagnostic models.

Machine learning-based classi cation of HCC patients and LC patients
To screen the best panel for HCC diagnosis, 130 HCC and 68 LC patients with PRM quantitative data for candidate serum proteins were used to construct a random forest predictive model and to correct the cut-off.80 HCC and 47 LC patients from the validation set were then introduced to assess the reliability of the model externally.We compared the area under the ROC curve (AUC) for 5 potential biomarkers and different combinations of permutations in the validation set (Table S7).This process resulted in a panel of HABP2 and CD163 with highest AUC (0.935), sensitivity (0.838) and speci city (0.872) to differentiate HCC from LC patients, which still maintain effective diagnosis in HCC patients who were negative for AFP (< 20 ng/mL), PIVKA-II (< 40 mAU/mL) and negative for both AFP and PIVKA-II (Figure S4A-D).
Next, we determined the performance of P4 panel in diagnosis of HCC at the early stage, due to the low sensitivity of existing diagnostic strategies.As the best performing serum biomarker combination in the diagnosis of HCC, AFP + PIVKA-II was used in comparison to P4 panel.The P4 panel had higher and more stable sensitivity than AFP + PIVKA-II in different clinical stages, especially in early HCC clinical stages like TNM I stage (0.875 vs 0.750), BCLC 0-A stage (0.902 vs 0.754), CNLC I stage (0.878 vs 0.683) (Fig. 6G, Table S8).It suggested that P4 panel possessed a good predictive effect in early diagnosis of HCC patients.

The P4 panel accurately predicted conversion of LC to HCC earlier
As well known that imaging remains the gold standard for the diagnosis of HCC compared to the clinical protein biomarkers AFP, PIVKA-II, and other reported score-based models.To assess the e cacy of the P4 panel in detecting early-stage HCC and to compare it with other commonly used methods, we recruited 132 LC patients in a prospective clinical cohort to collect imaging data, PRM quantitative results of HABP2 + CD163, traditional protein biomarker assessment results and the widely accepted aMAP risk score indicator including age, male, albumin-bilirubin and platelet data at a series of follow-up time points with LC patients developing HCC as the nal end-point.As expected, the P4 panel was effective in the diagnosis of early stage HCC with highest sensitivity and AUC, outperforming AFP, PIVKA-II and aMAP score (Fig. 7A, Table S9).The scores of P4 panel were signi cantly higher for LC patients who subsequently developed HCC than that for LC patients who did not develop HCC (p < 0.0001) (Fig. 7B).Signi cantly, all LC patients (100%, 17/17) who subsequently developed HCC were detected accurately (Table S10), which was signi cantly higher than AFP (47.1%, 8/17), PIVKA-II (29.4%, 5/17), AFP + PIVKA-II (88.2%, 15/17) and aMAP score (76.5%, 13/17), suggesting that the P4 panel might be a good predictor for LC patients at the risk of developing HCC (Fig. 7C, Figure S6A-D).While there were 27.3% (36/132) of LC patients inconsistent with the prediction for conversion to HCC, which had high scores but none had HCC at the end of follow-up, suggesting that the P4 panel also suffered from the inevitable false positive rate of other strategies.
Notably, the P4 panel detected all LC patients who subsequently developed HCC (100%, 17/17), completely consistent with the imaging diagnosis and is even earlier than the imaging ranging from 2.8 to 34.5 months, with a median of 12.6 months (Fig. 7D-E).In contrast, the proportion of HCC patients with abnormal AFP (47.1%, 8/17) or abnormal PIVKA-II (29.4%, 5/17) or higher aMAP scores (76.5%, 13/17) was lower than that predicted by P4 panel.Furthermore, P4 panel had better concordance with positive imaging ndings during follow-up (94.1% vs. 52.9%,52.9%, 70.6%) compared to AFP, PIVKA-II and aMAP scores at the follow-up periods, with 1 of the patients missing PRM data at diagnosis of HCC (Fig. 7F-G).These ndings suggest that P4 panel could indeed be a promising predictor of conversion to HCC in LC patients compared to traditional protein biomarkers or other scorebased models.

Discussion
In current studies, many biomarkers based on liquid biopsy have been developed for the cancer diagnosis or monitoring, for example methylated DNA [28][29][30] , ctDNA [31,32] and microRNA [33,34] .However, nucleic acid biomarkers are economic-and time-constrained for clinical applications due to their complex detection methods and high requirements for sample preservation and handling [35] .Proteins, on the other hand, have good stability and can be easily developed into biomarkers for clinical applications [36] .While, MS-based screening of serum protein biomarkers has unique challenges due to the interference of high abundance proteins [37] .In this study, we replaced traditional DDA-MS with DIA-MS, which eliminated the need for cumbersome and expensive abundance protein reseparation and fraction in individual serum samples [38,39] .Although the depth of our serum proteome could be improved, we detected hundreds of proteins that are not available in the human plasma proteome database, e.g.serum amyloid A-1 protein (SAA) and Proline-rich acidic protein 1 (PRAP1).The high dynamic range of protein abundance in serum limits the sensitivity of MS-based proteomics, while a median CV of 18% in our assay is much better than the biological variation.Furthermore, the use of PRM-MS targeted proteomics validation method improves the accuracy of high-throughput validation, which can be used for antibody-free and batched validation of candidate biomarkers cost-effectively, and could be further optimized for clinical translation [40,41] .Therefore, our work ow is well suited to study tumor-related serum protein variations at a proteomic scale and provides an important resource for screening early diagnostic biomarkers for HCC.
In this study, we developed and validated a 4-serum-protein based panel for early diagnosis of HCC with sensitivity 0.925, speci city 0.915 and AUC 0.979.In an independent validation cohort, the panel was able to identify occult HCC that did not observed by imaging with 100% accuracy, although there was a certain rate of false positives.The identi cation of high-risk populations for HCC by this panel that were diagnosed 1 year later through standard diagnostic methods demonstrated the utility of this panel for HCC screening and the potential for detecting high-risk populations for HCC through such screening.Therefore, we proposed that facile and scalable analyses of serum proteins based on serum proteomics could be used to prescreen high-risk populations for HCC to increase the accessibility of HCC detection and reduce unnecessary follow-up imaging procedures and invasive biopsies.In the current eld of HCC screening, our initial idea on how to integrate the panel with the clinic in the future comes from three sources: (1) the panel can be further used as a complementary testing for patients with AFP-negative or PIVKA-II-negative but with high risk clinical factors; (2) the panel can be used as a basis for further screening in patients who refuse imaging screening; (3) the panel can improve the detection rate of early HCC and serve as an alternative screening method for people at high risk of HCC.
Of course, we should acknowledge several limitations of this study.Firstly, multi-center and large-scale prospective clinical cohorts still need to be used to verify the universality of our model, including the sensitivity, speci city and accuracy of the model.Secondly, there was a lack of healthy controls in this study and more healthy samples were needed to con rm the speci city of the assay in future studies.Thirdly, whether the P4 panel in clinical routine to enable early diagnosis of HCC is need further test.Finally, in this study, we did not directly assess the model in patients with other cancer types, which can determine whether the model is speci c to HCC.
In summary, our study presented an effective spectrometry (MS)-based proteomics work ow for the discovery and validation of early diagnosis serum biomarkers of HCC.We developed an earlier and more accurate predictive panel for the conversion of LC to HCC than existing clinical methods, which may provide useful reference for the early diagnosis of HCC.

Patient cohorts and sample collection
For constructing the discovery cohort, 40 chronic asymptomatic hepatitis B virus carrier (AsC), 64 basic liver diseases (BLD) (including 16 chronic hepatitis B (CHB), 18 alcoholic liver disease (ALD) and 30 non-alcoholic fatty liver disease (NAFLD) samples), 53 liver cirrhosis (LC) and 163 hepatocellular carcinoma (HCC) serum samples of patients were enrolled.The inclusion criteria for HCC patients were: 1) Histopathologically con rmed hepatocellular carcinoma; 2) Radical hepatectomy performed [2] ; 3) No preoperative anti-cancer treatment; 4) had complete physio-biochemical clinicopathological data.Patients with CHB, ALD, NAFLD and LC required physiological and imaging evidence for the diagnosis.Detailed physiological and biochemical indicators for all patients, as well as clinicopathological characteristics for HCC patients can be found in the Table S1.
Serum samples of 210 HCC and 132 LC patients were further collected for the validation of candidate biomarkers by PRM-MS.Detailed inclusion and exclusion criteria of the patients were the same as for the discovery cohort.
The 132 LC patients were prospectively collected, who were followed up every 6 months with AFP, liver function tests, abdominal ultrasound and a contrast-enhanced computed tomography (CT) scan or magnetic resonance imaging (MRI) of the abdomen and a chest CT performed at each visit.The diagnosis of HCC followed the strict criteria of the European Association for the Study of the Liver (EASL).The average follow-up time of LC patients was 45.4 months and 17 LC patients were diagnosed with HCC during subsequent follow-up.The median time from enrollment to progression to HCC for these LC patients was 12.6 months.

All samples were collected from clinical specimen banks of Mengchao Hepatobiliary Hospital of Fujian Medical
University.Serum samples were collected using an intravenous tube without anticoagulant and coagulated at room temperature for 30 min, then centrifuged at 3,000 rpm for 10 min.The supernatant serum was collected and frozen at -80°C for subsequent use.

This project was approved by the Institution Review Board of Mengchao Hepatobiliary Hospital of Fujian Medical
University.Informed consent was obtained from each participant before the operation.The use of clinical specimens was completely in compliance with the "Declaration of Helsinki".

Separation of LAPs and HAPs for serum samples
To construct spectrum library for DIA-MS, 40 samples were randomly selected and mixed equally into one pool, so 320 samples were divided into 8 pools in total.HAPs and LAPs in serum were separated by high performance liquid chromatography (HPLC) using Human 14 Multiple A nity Removal System Column (Agilent Technologies, Santa Clara, CA, USA) according to the manufacturer's instructions.The collected fractions of LAPs and HAPs were concentrated into one tube by 3 K cutoff centrifugal lter for protein concentration measurement using the BCA assay, respectively [42] .

Protein digestion
Protein digestion for serum samples were performed using a modi ed Filter-Aided Sample Preparation (FASP) method as described previously [43] .To prepare the LAPs and HAPs peptides for MS detection, 500 µg protein samples were diluted by 400 µL lysis buffer (6 M urea and 1× protease inhibitor cocktail).Then, 8 mM dithiothreitol (DTT) was added to reduce for 30 min at 55°C, followed by 50 mM iodoacetamide (IAA) to alkylate for 30 min in dark at room temperature.The protein solutions were then washed twice with 100 mM tetraethylammonium bromide (TEAB) in 3 K cutoff centrifugal, and digested using trypsin at a concentration of 1:50 (w/w) for 18 h at 37°C.Digested peptides are centrifuged at 14,000 g for 30min and collected in the outer tube of the lter.Finally, the digested peptides were eluted and evaporated to dryness for LC-MS/MS analysis.
For peptide preparation of DIA-MS and PRM-MS, 10 µL of each individual serum sample was diluted by adding 200 µL lysis buffer, and 200 µg proteins were prepared in the same method as mentioned above for subsequent MS analysis, respectively.

High pH reversed-phase separation
LAPs and HAPs were fractioned by an o ine LC system (Acquity UPLC, Waters, the U.S.A) via high pH (pH = 10) separation, which was performed in C18 reverse phase column (2.1 mm× 50 mm, 1.7 µm, catalog No.186002350, Waters, the U.S.A) with a ow rate of 10 µL/min.Peptide mixtures were resuspended in mobile phase A and eluted with a subsequent and linear gradient as following: 0-5 min, 100% mobile phase A; 5-20 min, 93% mobile phase A; 20-25 min, 65% mobile phase A. Mobile phase A was H 2 O with 0.1% FA and mobile phase B was ACN with 0.1% FA.Starting at 1 min, one tube of fractions was collected every 30s.Thus, 48 tubes of LAPs or HAPs fractions were collected for each of the 25 min gradients.To reduce the MS detection time, we mixed these fractions with the same time interval.For example, fractions collected in 1, 4, 7, 10, 13 and 16 min were mixed into 1 tube.Finally, a nal of 8 samples were dried by vacuum centrifugation for proteomic analysis.

Data acquisition by DDA mass spectrometry
For spectrum library construction, 1 µg of each fraction was added to the nano liquid chromatography (Easy-nLC 1000, Thermo, the U.S.A) which was linked with a Trap Quadrupole-Orbitrap mass spectrometer (Q Exactive plus, Thermo, the U.S.A).Brie y, the peptide was resuspended in mobile phase C (0.1% FA in water) and equal amounts of indexed retention time (iRT) peptide standards (Biognosys, Switzerland) were spiked into each sample.And the peptides were separated onto the C18 analytical column (75 µm × 250 µm, 1.8 µm, catalog No. 164534, Thermo Fisher Scienti c, the U.S.A) with a 70 min gradient at a constant ow rate of 300 nL/min (0-3 min, 4 to 7% of mobile phase D; 3-45 min, 7 to 14% mobile phase D; 45-60min, 14 to 30% of mobile phase D, 60-70 min, 30 to 90% of mobile phase D and held at 90% mobile phase D for 15min.Mobile phase C was 0.1% FA in water, mobile phase D was 0.1% FA in ACN).Mass spectrometry was operated under a DDA mode with 1.9 kV electrospray voltage at the inlet.DDA scheme was included a full MS scan from 300 to 1,800 m/z at a 70,000 resolution (at m/z of 200) using an AGC target value of 3E6.15 most intensive precursors of MS/MS scan were selected for high energy collisional dissociation (HCD) with 27% normalized collision energy.MS/MS spectra were acquired at resolution of 17,500 (at m/z of 200) using an AGC target value of 1E5 and a maximum injection time (IT) of 45 ms.Dynamic exclusion was applied with a repeat count of 1 and an exclusion time of 30 s.

Construction of spectral library and analysis of DIA-MS
For spectrum library construction, both DDA and DIA les were processed using Spectronaut (Version.X, Biognosys, Switzerland) [44] .The background database was built with FASTA le of Homo sapiens containing 20368 reviewed proteins (Published by Uniprot in 2020 year), combining with the fusion sequence of iRT.Digest enzyme was trypsin/P and max missed cleavages only allowed 2. Carbamidomethyl was set to xed modi cation and acetyl (Protein N-term) and oxidation were set to variable modi cation.The false discovery rate (FDR) was set to 1% at peptide precursor level and 1% at protein level.For the quantitative analysis of proteins across the 320 serum samples, 320 DIA raw data les were searched against the hybridized spectral library followed by the quanti cation via Spectronaut Pulsar X. Q-value cutoff of protein and precursor were both set to 0.01.

Quality control of the mass spectrometry platform
To evaluate the performance of the mass spectrometry systems, the Hela standard peptides (Pierce, the U.S.A) was measured in the process of the project as the quality-control standard.DDA analysis were interspersed per 2 experimental samples and DIA-MS analysis were interspersed per 10 experimental samples.The standard was analyzed using the same method and conditions as using in the HCC related serum samples.A Pearson's correlation coe cient was calculated for all quality-control runs based on package corrplot of R (Version.4.0.2)

Data processing for serum proteomics data
The pre-processing of the proteomic data was performed using wkomics (https://omicsolution.org/wkomics/main/)analysis platform [45] .For proteins with ≥ 40% integrity in HCC or non-HCC samples, missing values were lled with the SeqKNN method; while when integrity < 40% in both two groups, missing values were lled with a minimal value.Then median normalization and Log 2 transformation were performed for subsequent data analysis.

Identi cation and functional analysis of HCC-related differentially abundant proteins
Differentially abundance proteins were identi ed using an independent sample t-test and Benjamini-Hochberg multiple test adjustment based on R-4.0.2.Proteins with P < 0.05 and FC ≥ 1.2 were eligible as differentially abundant proteins.Gene ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were used to enrich the functions and pathways of HCC-related differentially abundant proteins.The statistical difference of the enrichment was evaluated by the hypergeometric test and the method of Benjamini-Hochberg multiple test adjustment.Protein-protein interaction (PPI) network analysis was performed using online analysis tool String (https://cn.string-db.org/).Proteins clustering in PPI was based on K-means algorithm and visualization of the network diagram used mapping tool Cytoscape.

Selection of unique peptides of HCC candidate diagnosis biomarkers
Learning Vector Quantization (LVQ) model [46] achieved via package mlbench and caret of R (Version.4.0.2) was used to evaluate the accuracy of single protein in identifying HCC patients for HCC-related differentially abundant proteins, and the candidate proteins were selected with accuracy higher than 0.8.The ionic characteristics of peptides matched candidate proteins were continued to evaluate as follows: the selected peptides were required to be unique and without modi cation; the best length of peptides was 8-12 AAs and have at least 5 ions with the X.L. and X.X.led this project in generating proteomics data, data analysis, data validation and manuscript preparation.Y.W. and Y.Z.performed the collection and provision of clinical samples.L.C., E.H., Z.L. and C.H. contributed to the collection and collation of clinical data.L.C., J.O., F.W. and Z.L. contributed to proteomics sample preparation of serum.X.X., L.C. and J.O. coordinated mass spectrometry data acquisition.X.X. and L.C. performed proteomics data analysis and statistical analysis.L.C. and J.O. constructed the data portal; L.W., J. L. and X.L. supervised the project, and revised and reviewed the manuscript.All the authors contributed to manuscript revision.

DECLARATION OF INTERESTS
No con icts of interest declared.Large scale DIA-based proteomics were used to select HCC-related biomarker candidates, which were then validated in an independent validation cohort using PRM-based targeted proteomic approach.HCC diagnosis models were constructed based on machine learning and the e cacy of the models for HCC risk prediction was assessed through prospective long-term follow-up of LC patients.Proteins identi ed in the 4 groups were ranked according to their intensity.The top 10 most abundant proteins are labeled, and their relative contribution to the total protein intensity is indicated.

For
data acquisition of DIA-MS, equal amount of iRT peptide standards were spiked into individual sample peptide, and nano-LC MS/MS basic parameter settings were the same as that for DDA.In DIA mode, DIA scheme was included a full MS scan from 400 to 1,200 m/z at a 140,000 (at m/z of 200) resolution and 32 MS/MS scans were acquired with a 3,500 resolution at a m/z of 200 and a max IT of 20 ms.The cycle of 32 MS/MS scans (center of isolation window) with two kinds of wide isolation window are as follows (m/z): 410-990 m/z with 20 m/s wide and 1050-1150 m/z with 100 m/z wide.The dynamic exclusion time was set to 20 s.

Figures
Figures

Figure 1 Overall
Figure 1

Figure 2 MS
Figure 2

Figure 3 Quality
Figure 3

Figure 7 Performance
Figure 7